Universal Character Set characters

The Unicode Consortium (UC) and the International Organisation for Standardisation (ISO) collaborate on the Universal Character Set. (UCS). The UCS is an international standard to map characters used in natural language (as opposed to programming languages for instance) characters into numeric — machine readable — values. By creating this mapping, the UCS enables computer software vendors to interoperate and transmit UCS encoded text strings from one to another

ISO maintains the basic mapping of characters from character name to code point. Often the terms character and code point will get used interchangeably. However, when a distinction is made, a code point refers to the integer of the character: what one might thing of as its address. While a character in UCS 10646 includes the combination of the code point and its name, Unicode adds many other properties to the character set. Together, these properties further define each character.

In addition to the UCS Unicode also provides other implementation details such as:

  1. transcending mappings between UCS and other character sets
  2. different collations of characters and character strings for different languages
  3. an algorithm for laying out bidirectional text, where text on the same line may shift between left-to-right and right-to-left
  4. a case folding algorithm

Computer software end users enter these characters into programs through various input methods. Input methods can be through keyboard or a graphical character palette.

Contents

Divisions of UCS

The UCS can be divided in various ways: plane, category, block, etc. Unicode and ISO divide it into 17 planes, each capable of containing 65,534 distinct characters or 1,114,078 total. As of 2007 (Unicode 5.0) ISO and the Unicode Consortium has only allocated characters and blocks in six of the 17 planes The others remain empty and reserved for future use.

  1. Basic Multilingual Plane (BMP). This plane contains most of the characters needed for scripts and languages in routine use in the world today. The plane is nearly filled with only approximately 3,700 of the 65,534 code points remaining to be defined.
  2. Supplementary Multilingual Plane (SMP). Currently used for many ancient scripts and characters as well as musical and mathematical notation.
  3. Supplementary Ideographic plane (SIP). Used for ideographic characters used in many languages in China, Japan, Korea, Taiwan, Vietnam and Singapore.
  4. Supplementary Special-purpose Plane (SSP). For special-purpose characters such as compatibility control characters.
  5. Private Use Plane A. Together the Private Use planes provide 131,068 characters — in addition to the 6,400 private use code points provided in the BMP — for definition by organizations outside Unicode and ISO 10646. Such private use definers might be operating system vendors, font vendors, or other independent standards organizations.
  6. Private Use Plane B.

By block

Unicode adds a block property to UCS that further divides each plane into separate blocks. Each block is a grouping of characters by their use such as "mathematical operators" or "Hebrew script characters". When assigning characters to previously unassigned code points, the Consortium typically allocates entire blocks of similar characters: for example all the characters belonging to the same script or all similarly purposed symbols get assigned to a single block. Blocks may also maintain unassigned or reserved code points when the Consortium expects a block to require additional assignments.

By type

UCS may also be divided according to the types of characters: script, symbol, diacritical, punctuation and so on.

Types include:

Special code points

Among the millions of code points available in UCS, many are set aside for other uses or for designation by third parties. These set aside code points include non-character code points, surrogates, and private use code points.

Non-characters

Non-character code points are set aside and guaranteed to never be used for a character. Each of the 17 planes has its two ending code points set aside as non-characters. Another non-character code point is the reverse of the byte order mark (U+FEFF). When encountering the reverse byte order mark non-character, this serves as an indication that the byte order of the text has been misinterpreted.

Surrogates

The UCS uses surrogates to address characters outside the initial Basic Multilingual Plane without resorting to more than 16 bit byte representations. By combining pairs of the 2,048 surrogate code points, the remaining characters in all the other plains can be addressed (1,024 × 1,024 = 1,048,576 code points in the other 16 planes). In this way, UCS has a built-in 16 bit encoding capability for UTF-16.

Private use

The UCS guarantees it will never assign characters to these (137,468) code points. Operating system and font vendors and communities of end-users may use these for their own agreed-on use.

Characters grapheme clusters and glyphs

Whereas many other character sets assign a character for every, possible glyph representation of the character, Unicode seeks to treat characters separate from glyphs. This distinction is not always unambiguous, however a few examples will help illustrate the distinction. Often two characters may be combined together to typographically improve the readability of the text. For example, the three letter sequence "ffi", may be treated as a single glyph. Other characters sets would often assign a code point to this glyph in addition to the individual letters: "f" and "i".

In addition, Unicode approaches diacritic modified letters as separate characters that, when rendered, become a single glyph. For example, an "o" with diaeresis: "ö". Traditionally, other character sets assigned a unique character code point for each diacritic modified letter used in each language. Unicode seeks to create a more flexible approach by allowing combining diacritic characters to combine with any letter. This has the potential to significantly reduce the number of active code points needed for the character set. As an example, consider a language that uses the Latin script and combines the diaeresis with the upper- and lower-case letters "a", "o", and "u". With the Unicode approach, only the diaeresis diacritic character needs to be added to the character set to use with the Latin letters: "a", "A", "o", "O", "u", and "U": seven characters in all. A legacy character sets needs to add six precomposed letters with a diaeresis in addition to the six code points it uses for the letters without diaeresis: twelve character code points in total.

Compatibility characters

UCS includes thousands of characters that Unicode designates as compatibility characters. These are characters that were included in UCS in order to provide distinct code points for characters that other character sets differentiate, but would not be differentiated in the Unicode approach to characters.

The chief reason for this differentiation was that Unicode makes a distinction between characters and glyphs. For example, when writing English in a cursive style, the letter "i" may take different forms whether it appears at the beginning of a word, the end of a word, the middle of a word or in isolation. Languages such as Arabic written in an Arabic script are always cursive. Each letter has many different forms. UCS includes 731 Arabic form characters that decompose to just approximately 100 unique Arabic characters. However, the additional 731 Arabic characters are included so that text processing software may translate text from other characters sets to UCS and back again without any loss of information crucial for non-Unicode software.

However, for UCS and Unicode in particular, the preferred approach is to always encode or map that letter to the same character no matter where it appears in a word. Then the distinct forms of each letter are determined by the font and text layout software methods. In this way, the internal memory for the characters remains identical regardless of where the character appears in a word. This greatly simplifies searching, sorting and other text processing operations.